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•  The  design  and  implementation  of  a  l- Dimensional  median  filter  in  VLSI  is  presented.  The  device 
is  designed  to  operate  on  8-bit  sample  sequences  with  a  window  size  of  5  samples.  Extensive 
pipelining  and  employment  of  systolic  concepts  at  the  bit  level  enable  the  chip  to  filler  at  rates  up  to 
10  Mega-samples  per  second  The  chip  is  designed  to  be  implemented  with  a  }d  =  2.5$  NMOS 
technology  and  is  6.2  mm  by  5.0  mm  in  size.  A  circuit  configuration  for  using  the  chip  in 
approximate  2-D  median  filtering  is  also  presented  v- '  . <, 
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1.  Introduction 

Median  filtering  is  a  nonlinear  signal  smoothing  operation  in  which  the  median  of  a  window  of  size 
w  :b  2n  +1  replaces  the  sample  at  the  middle  of  the  window.  Medians  computed  in  this  way  tend  to  follow 
the  polynomial  trends  in  the  original  sequence  while  sharp  discontinuities  of  short  duration  are  filtered  out. 
Further  properties  of  median  filtering  have  been  described  in  [1]  while  [2]  describes  its  application  to  speech 
processing.  Recently,  an  algorithm  for  real-time  median  filtering  has  been  presented  in  [3].  Systolic 
algorithms  for  one*  and  multi-dimensional  median*filtering  operations  and  the  more  general  case  of 
computing  running*order  statistics  have  been  recently  proposed  by  Fisher  [4]. 

This  work  presents  the  design  and  implementation  of  a  VLSI  chip  for  the  1-dimensional  median-filtering 
operation.  The  device  is  designed  to  operate  on  8-bit  sample  sequences  with  a  window  size  of  5  samples. 
Extensive  pipelining  and  employment  of  systolic  concepts  at  the  bit  level  enable  the  chip  to  have  a  very  high 
throughput,  Le.  the  chip  can  be  clocked  at  rates  up  to  10  Mhz  and  produce  one  median  every  clock  cycle  after 
an  initial  delay  to  fill  the  pipeline.  The  chip  is  designed  to  operate  as  a  shift  register  in  a  system  environment, 
filtering  data  coining  from  the  source  before  going  into  die  actual  computing  system. 

2.  Systolic  Algorithms  and  Structures 

Rapidly  advancing  VLSI  technology  offers  system  designers  a  very  high  potential  for  parallel  operations. 
However,  in  order  to  exploit  this  potential,  algorithms  to  be  implemented  with  VLSI  computing  structures 
should  have  regular  and  simple  communication  schemes.  This  is  mainly  due  to  the  fact  that  communication, 
especially  irregular  communication,  is  costly  in  VLSI  in  terms  of  the  chip  area  that  communication  channels 
(Le.  wires)  occupy.  Furthermore,  to  reduce  the  design  time,  these  algorithms  should  employ  a  rather  small 
number  of  bask  building  blocks  (or  cells)  from  which  larger  systems  can  be  built 

A  class  of  parallel  algorithms  that  exhibit  such  regular  structures  are  systolic  algorithms.  Systolic  algorithms 
for  various  computational  problems  have  been  described  in  [4, 5, 6, 7, 8].  Systolic  data  structures  for  priority 
queue  operations  and  connectivity  problems  have  been  proposed  in  [9]  and  in  [10]  respectively.  The  general 
architectural  principles  of  systolic  computation  systems  have  been  discussed  by  Kung  in  [11].  In  general, 
systolk  algorithms  and  the  underlying  hardware  structures  implementing  them  have  very  regular  neighborto- 
neighbor  communication  schemes.  They  utilize  their  inputs  many  times  through  pipelining  and 
multi-directional  data  flow  and  hence  do  not  make  heavy  bandwidth  demands  on  system  memories. 

Employment  of  systolk  concepts  at  the  low-level  implementation  of  logic  circuits  for  various  simple 
functions  (like  addition  and  comparison)  also  leads  to  regular  structures  that  have  small  propagation  delays 
(independent  of  the  size  of  the  circuit)  and  require  no  broadcasting.  Such  circuits  are  suitable  as  building 


blocks  in  higher-level  pipelined  structures.  A  previous  chip  employing  such  concepts  is  the  pattern  matching 
chip  described  in  [12]. 

In  the  last  few  years  various  special  purpose  chips  employing  systolic  algorithms  have  been  designed  at 
Camegie-Mellon  University.  These  include  a  pattern  matching  chip  [12],  an  image  processing  chip  [13],  and  a 
tree  processor  for  database  applications  [14], 

3.  The  1 -Dimensional  Median-Filtering  Algorithm 

The  1-D  median-filtering  algorithm  implemented  is  differs  from  the  one  described  in  [4]  in  the  sense  that  it 
uses  the  odd/even-transposition  sort  [15, 16, 6]  as  the  high-level  algorithm  and  exploits  systolic  data  flow 
concepts  at  the  bit  level  to  achieve  a  very  high  throughput  After  an  initial  delay  to  All  the  pipeline,  the  chip 
can  produce  one  median  over  a  sliding  5-wide  window  at  every  clock  period.  The  logic  design  enables  the  use 
of  a  clock  period  that  is  long  enough  to  cover  the  propagation  delays  of  five  NMOS  gates.  However,  due  to 
technological  limitations,  the  method  employed  is  suitable  only  for  small  window  sizes  (3  to  7)  because  the 
network  implementing  the  pipelined  odd/even-transposition  sort  requires  area  proportional  to  the  square  of 
the  window  size.  The  systolic  algorithms  presented  in  [4]  require  area  linear  in  window  size  but  they  need 
more  complex  circuits. 

3.1 .  High-Level  Structure  of  the  Algorithm 

At  the  high  level,  the  algorithm,  and  hence  the  underlying  hardware  that  implements  it,  consists  of  an  input 
stage  which  generates  the  successive  window  elements  from  the  incoming  sample  stream,  and  a  pipelined  sort 
stage  which  performs  the  odd/even-transposition  sort  on  the  elements  of  successive  windows  (see  Fig.  3-1). 
Shamos,  in  [17],  has  proposed  similar  circuits  for  median  finding;  in  fact,  a  circuit  proposed  there  for  a 
window  of  size  5  uses  fewer  of  comparators  than  the  circuit  presented  here,  but  Shamos’  circuit  structure  is 
not  regular. 

The  input  stage  is  basically  a  shift  register.  At  every  clock,  it  reads  one  sample  value  from  the  input  and 
discards  the  sample  value  read  five  clocks  earlier.  This  effectively  slides  the  window  of  the  filter  over  the 
incoming  sample  stream.  Hence,  a  new  window  is  presented  to  the  odd/even-transposition  sort  network  at 
every  dock. 

The  odd/cvcn-transposition  sort  network  is  a  pipelined  structure  consisting  of  compare-and-swap  stages 
that  operate  on  even  and  odd  pairs  of  window  elements1  alternately.  Five  such  alternating  stages  implement 
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Figure  3*1:  High  level  structure  of  the  algorithm 

die  odd/even-transposition  sort  for  five  sample  values.  Since  the  sample  values  in  a  window  pass  through  one 
stage  of  the  odd/even-transposition  sort  network  in  one  clock,  it  is  possible  to  pipeline  the  sorting  of 
successive  windows  through  the  network 

Each  stage  of  the  odd/even-transposition  sort  network  consists  of  2  8-bit  compare-and-swap  units 
([window  size  /  2  J  units  in  general)  and  one  delay  element  to  store  the  window  sample  value  that  does  not  get 
compared  at  that  stage,  due  to  the  fact  that  the  window  size  is  odd. 

Each  8-bit  compare-and-swap  unit  compares  the  pair  of  8-bit  numbers  at  its  input  and  interchanges  them  if 
necessary  so  that  the  larger  of  the  numbers  is  at  the  "top".  At  the  output  of  the  last  stage,  the  window 
elements  will  be  sorted  such  that  the  largest  will  be  at  the  "top" . 

3.2.  Hardware  Implementation  of  the  Algorithm 

The  structure  of  the  odd/cvcn-transposition  sort  network  described  above  has  certain  undesirable 
characteristics  if  directly  mapped  into  hardware.  In  the  compare-and-swap  units,  the  swapping  of  the  inputs 
can  only  be  done  after  the  result  of  the  entire  8-bit  comparison  has  been  computed.  However,  this  requires 
waiting  for  a  long  propagation  delay  through  8  stages  of  bitwise  comparators. 
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It  is  possible  to  get  rid  of  this  propagation  delay  by  employing  systolic  concepts  at  the  bit  level  This 
involves  breaking  up  the  compare-and-swap  operation  into  steps  and  then  distributing  them  over  time  by 
skewing  the  bits  of  the  numbers  being  compared  with  delay  elements  so  that  each  pair  of  bits  arrives  at  their 
bitwise  comparator  at  the  same  time  the  subresult  of  the  comparison  of  their  more  significant  counterparts 
arrives. 

The  basic  element  to  implement  the  compare- and-swap  operation  is  the  bitwise  compare- artd-swap  unit 
The  functional  description  of  this  unit  is  given  in  Fig.  3*1  It  is  a  bit  comparator  followed  by  two  multiplexers 
which  pass  the  larger  of  die  inputs  to  the  A  output  and  the  smaller  to  the  B  output,  if  E.m  is  asserted. 
Otherwise  it  unconditionally  swaps  or  passes  the  inputs  depending  on  whether  is  asserted  or  not.  It  also 
passes  "downward”  the  cumulative  subresult  of  the  comparison  to  the  less  significant  stages. 


Lout  Eout 


Figure  3-1  The  basic  compare-and-swap  unit 

The  8-bit  wide  compare-and-swap  units  implemented  with  the  units  described  above  also  distribute  the 
swap  operation  over  time  along  with  the  comparisons.  So  at  the  end  each  bitwise  compare-and-swap 
operation,  the  outputs  of  the  bit  compare-and-swap  units  will  be  same  as  they  would  be  if  all  the  bitwise 
swaps  were  done  simultaneously  after  waiting  for  the  final  comparison  result  This  is  easy  to  see  if  we  note 
that  if  a  less  significant  compare-and-swap  unit  decides  that  all  the  input  bits  should  be  swapped,  then  the 
inputs  to  more  significant  stages  should  have  been  equal  hence  passing  them  without  swapping  would  not 
matter. 

The  implementation  of  the  odd/cvcn-transposition  sort  network  exploits  the  observations  presented  above. 
Furthermore,  the  comparisons  of  the  next  stage  of  the  odd/even-transposition  sort  network  can  be  started 
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immediately  once  the  comparisons  and  swaps  of  the  first  bits  of  the  preceding  stage  are  done.  This 
observation  leads  to  an  internal  block  structure  of  the  odd/even-transposition  sort  network  given  in  Fig  33.  It 
should  also  be  noted  that  in  the  resulting  structure,  the  first  bits  of  the  sorted  outputs  are  available  even 
before  die  comparisons  of  the  first  stage  of  the  odd/even-transposition  sort  network  are  completed. 

4.  The  Chip 

The  chip  employing  the  method  described  in  the  preceding  section  has  been  designed  to  be  implemented 
with  an  NMOS  process  with  X  of  23  microns.  The  basic  methodology  and  the  design  rules  presented  in  [18] 
have  been  used  throughout  the  design  and  layout  process.  The  outline  of  the  floor  plan  of  the  resulting  chip  is 
given  in  Fig.  4-l.The  dimensions  of  the  chip  are  approximately  63  mm  by  S.O  mm . 

It  uses  21  pins :  8  for  input,  8  for  output,  2  for  foe  two  phases  of  foe  clock,  1  for  ,  1  for  Ground  and  1 
for  the  substrate  bias;  hence  it  can  be  packaged  in  a  24  pin  package. 

As  of  this  writing,  foe  chip  has  been  laid  out  completely  and  design-rule  checks  have  been  made.  Circuit 
level  simulations  of  foe  circuits  making  up  foe  sort  network  have  been  done.  Currently  foe  chip  is  being 
fabricated  by  foe  ARPA  facility  coordinated  by  USC-ISI . 

5.  Application  to  2-Dimensional  Image  Processing 

Although  the  design  is  not  directly  applicable  to  2- Dimensional  median  filtering  operation,  a  cascade 
configuration  using  these  chips  can  be  used  for  approximate  median  filtering  of  2-D  images  as  suggested  by 
Shamos  [17].  The  basic  idea  is  to  find  foe  medians  of  foe  rows  of  an  nx  n  window  and  then  compute  tire 
median  of  foe  medians.  It  is  shown  in  [17]  that  Am  foe  median  of  foe  medians  of  such  a  window,  has  foe 
property  that 

nmJtAjZ  (  **  +  2/1  +  1  )  /4 

and 

ra nk(Aj£  (  3n2-2n  +  3  )  /4 

This  result  indicates  that  such  a  configuration  is  guaranteed  to  filter  out  foe  upper  and  lower  quartiles  of 
foe  samples  in  foe  window.  Simulation  results  obtained  by  Shamos  also  indicate  that  for  n  =  5 
Prob(rank(AJ  =  13) 03900 and  Prob ( 12 £ rank{A^  2 14)  1 0.7200. 

The  1-D  median-filter  chip  can  be  used  in  the  configuration  given  in  Fig.  S-l  to  implement  the 
approximate  2-D  media*  filtering.  This  configuration  operates  in  the  following  way:  the  1-D  median-filters 
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Figure  3-3:  Internal  structure  of  the  odd/even-transposition  sort  network 
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on  the  left  filter  the  rows  of  the  5  x  5  window  sliding  over  the  rows  and  output  the  medians  of  the  rows 
skewed  in  time.  The  multiplexers  serialize  the  parallel  incoming  medians  into  the  1-D  median  filters  on  the 
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right  and  also  pass  the  select  code  coining  from  the  upper  multiplexers  or  the  counter  to  the  next  multiplexer 
below.  The  median  filters  on  the  right  operate  on  skewed  window  outputs  in  parallel,  computing  the  medians 
of  the  medians  of  the  rows.  However  they  generate  one  result  every  5  time  steps.  Finally,  the  last  multiplexer 
selects  and  outputs  the  approximate  medians  ( A, )  coming  out  of  the  median-filters.  It  can  be  noted  that  this 
configuration  can  filter  at  at  rate  of  50  Mega-samples  per  second. 

6.  Evaluation  and  Conclusions 

The  design  and  implementation  of  a  VLSI  chip  for  performing  the  1-D  median-filtering  operation  has  been 
presented.  The  major  motivation  for  this  work  has  been  to  apply  systolic  concepts  at  the  bit  level  in  the 
implementation  of  logic  circuits  to  construct  a  digital  system  with  a  very  high  throughput  Also,  application  of 
the  developed  chip  to  2-D  image  processing  has  been  investigated  and  a  configuration  for  employing  it  in 
approximate  2-D  median  filtering  has  been  proposed. 

Although  the  design  developed  in  this  work  has  a  very  high  throughput  the  response  time  is  k  +  w  where 
k  is  the  number  of  bits  in  each  sample  and  w  is  the  window  size  (so  the  response  time  for  this  specific 
implementation  is  13  clock  periods).  Furthermore,  the  design  is  not  practical  for  larger  window  sizes  because 
the  silicon  area  for  implementing  the  odd/even-transposition  sort  network  grows  as  the  square  of  the  window 
size. 
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